fix(sqlite-provider): use caller-provided key column name#9
fix(sqlite-provider): use caller-provided key column name#9anoop-narang merged 2 commits intomainfrom
Conversation
| a.iter() | ||
| .zip(b.iter()) | ||
| .map(|(x, y)| (x - y) * (x - y)) | ||
| .sum::<f32>() |
There was a problem hiding this comment.
P1 — breaking API change bundled into a bug-fix PR.
Removing .sqrt() changes l2_distance from returning actual Euclidean distance to returning squared L2. While sort order is preserved (ranking is still correct), any user with a threshold-based filter silently gets wrong results:
-- before: this excluded rows farther than 1.5 actual units apart
-- after: this excludes rows farther than sqrt(1.5) ≈ 1.22 actual units apart
WHERE l2_distance(vector, query) < 1.5The consistency argument (match USearch's internal L2sq) is valid, but the right fix is to normalize the index path (apply sqrt to USearch's returned distances) — not to change the UDF's contract. The function name l2_distance strongly implies actual Euclidean distance.
Options:
- Revert this change and apply
sqrtto USearch distances on the index path instead. - If L2sq semantics are intentional, rename to
l2sq_distance(orl2_distance_squared) so callers aren't silently broken.
This change should also be a separate PR from the SQLite key-column fix.
There was a problem hiding this comment.
This PR only contains the sqlite-provider key column fix. The l2_distance sqrt removal was in PR #8 (already merged and squash-merged to main). After rebasing, the diff for this PR contains only the sqlite_provider.rs changes. The bot appears to have reviewed a stale diff before the rebase.
| let sql = format!( | ||
| "SELECT id FROM items WHERE label = 'nonexistent' ORDER BY l2_distance(vector, {Q}) ASC LIMIT 2" | ||
| ); | ||
| let ids = collect_ids(&ctx, &sql).await; |
There was a problem hiding this comment.
P2 — the actual bug (non-row_idx key column in SqliteLookupProvider) has no regression test.
The PR test plan notes that existing tests still use row_idx, which is coincidentally the default. A test that builds a SqliteLookupProvider with a schema whose first field is named something other than row_idx (e.g. _key) and then calls fetch_by_keys would directly exercise the fixed code path and guard against future regressions.
There was a problem hiding this comment.
Added in 6934d3a — test_custom_key_column_name exercises the provider with _key as the key column, testing both fetch_by_keys and projection.
There was a problem hiding this comment.
Review
Issues
P1 — src/udf.rs line 31: breaking semantic change to l2_distance
Removing .sqrt() changes the UDF's return value from actual Euclidean distance to squared L2. Sort order is preserved so ANN ranking is unaffected, but threshold-based filters silently produce wrong results for any threshold ≠ 1.0. This is a public API contract violation bundled into a SQLite bug-fix PR.
P2 — tests/execution.rs: no regression test for the stated bug fix
All SQLite-provider tests use row_idx as the key column, which happened to work before the fix. There is no test that exercises SqliteLookupProvider with a non-row_idx key column (e.g. _key), which is the exact scenario the PR claims to fix.
Action Required
- Either revert the
l2_distancesqrt removal and normalize the index path instead, or rename the UDF tol2sq_distanceto avoid silent breakage for callers with distance thresholds. Either way, split it into a separate PR. - Add a
SqliteLookupProvidertest that uses a key column name other thanrow_idx.
The SqliteLookupProvider previously hardcoded "row_idx" as the key column name in CREATE TABLE and WHERE clauses. This caused errors when callers used a different key column name (e.g. "_key"). Now derives the key column name from the first field in the provided schema, making the provider work with any key column name.
f70c668 to
e49e0d1
Compare
| parquet_col_indices: &[usize], | ||
| ) -> DFResult<Self> { | ||
| // The first field in the schema is the key column (INTEGER PRIMARY KEY). | ||
| let key_col = schema.field(0).name().clone(); |
There was a problem hiding this comment.
Suggestion: all existing tests happen to use row_idx as the key column name, so the only coverage here is "dynamic name produces the same SQL as the old hardcoded string." Consider adding a test in sqlite_provider_test.rs that builds a provider with a schema whose first field is named something other than row_idx (e.g. _key) and confirms that fetch_by_keys returns the correct rows. That would give you a direct regression guard for the actual bug scenario.
There was a problem hiding this comment.
Done — added test_custom_key_column_name that uses _key as the key column name.
Exercises SqliteLookupProvider with "_key" as the key column (the scenario used by runtimedb), verifying both fetch_by_keys and projection work with non-default key column names.
| let keys_vec = keys.to_vec(); | ||
| let pool = self.pool.clone(); | ||
| let table_name = self.table_name.clone(); | ||
| let key_col = self.key_col.clone(); |
There was a problem hiding this comment.
Non-blocking suggestion: fetch_by_keys takes a _key_col: &str parameter (from the PointLookupProvider trait) that is silently ignored here in favour of self.key_col. This was already the pre-existing behaviour in HashKeyProvider, so this PR is consistent, but it means a caller who passes the wrong column name gets no feedback.\n\nConsider a debug_assert_eq!(_key_col, self.key_col) guard, or removing the parameter from the trait entirely in a follow-up.
Summary
row_idxas the key column name in CREATE TABLE and WHERE clausesno such column: row_idxerrors when the caller's schema used a different key column name (e.g._key)Test plan
row_idxas key col — still works since name comes from schema)cargo fmt --checkcleancargo clippy --all-features -- -D warningsclean